Load multiple packages to your environment using the following code (you can add more packages to the current list as per your need):
knitr::opts_chunk$set(echo = TRUE)
library(pacman)
p_load(tidyverse, foreign, corrplot, stargazer, coefplot, effects, psych, ggcorrplot)
- RQ: To what extent has the decrease in the racial pay gap over time been
influenced by the different economic sources and trajectories of men and women?
- The authors are focused on the intersection of race and gender in producing
wage gaps, and their proposal of a new theoretical framework aims to establish
reasonable expectations for wage disparities between different races/genders.
Importantly, the authors try to create a framework that can account for
changes in the racial pay gap over time.
Note: you can obtain this from your “Data Cart”.
Import the dataset sat_math.dta to your R environment and examine the effect of IQ and other variables on SAT math score. Hint: use read.dta()
# loading data into environment
sat_data <- read.dta("sat_math.dta")
| Variable Name | Variable Detail |
|---|---|
sat_math |
SAT Math Score |
female |
The Female Dummy (Male = 0) |
black/other |
Two Racial Dummies (White as the Reference Group) |
meduy |
Mother’s Years of Schooling |
feduy |
Father’s Years of Schooling |
hours |
Average Weekly Study Hours |
IQ |
IQ Score (0 to 100) |
# Grouping data by gender and producing descriptive statistics
sat_data %>% group_by(female) %>% summarise(mean_satmath = mean(sat_math), mean_meduy = mean(meduy), mean_feduy = mean(feduy), mean_hours = mean(hours), mean_IQ = mean(IQ))
## Set use = "complete.obs" to ignore observations with NAs
M <- cor(sat_data, use = "complete.obs")
# Save the matrix to a dataframe, then use `ggcorrplot` to visualize
ggcorrplot(as.data.frame(M),
hc.order = TRUE,
type = "lower",
lab = TRUE)
sat_math, choose one numeric independent variable (IV) that seems to have a meaningful relation to the DV based on the correlation matrix you created, and then create the following plots:# (a) A scatter plot of the DV and IV
sat_data %>%
ggplot(aes(x = hours, y = sat_math)) +
geom_point(shape = 1, alpha = 0.7) +
labs(title = "Relationship Between Average Weekly Study Hours and SAT Math Score",
x = "Average Weekly Study Hours",
y = "SAT Math Score")
# (b) A scatter plot of the DV and IV with a fitted linear regression line
sat_data %>%
ggplot(aes(x = hours, y = sat_math)) +
geom_point(shape = 1, alpha = 0.7) +
geom_smooth(method = "lm", se = F) +
labs(title = "Relationship Between Average Weekly Study Hours and SAT Math Score",
x = "Average Weekly Study Hours",
y = "SAT Math Score")
## `geom_smooth()` using formula 'y ~ x'
# (c) A scatter plot of the DV and IV, and each observation is color coded by gender
# Create new variable 'gender'
sat_data <- sat_data %>%
mutate(gender = ifelse(female == 1, "female", "male"))
sat_data %>% as_tibble() %>% ggplot(aes(x = hours, y = sat_math, color = gender)) +
geom_point(shape = 1) +
labs(title = "Relationship Between Average Weekly Study Hours and SAT Math Score",
x = "Average Weekly Study Hours",
y = "SAT Math Score",
subtitle = "Grouped by gender")
# (d) On top of plot (c), fit a linear regression line for each gender group, the lines should also be color coded
sat_data %>% as_tibble() %>%
ggplot(aes(x = hours, y = sat_math, color = gender)) +
geom_point(shape = 1) +
geom_smooth(method = "lm", se = F) +
labs(title = "Relationship Between Average Weekly Study Hours and SAT Math Score",
x = "Average Weekly Study Hours",
y = "SAT Math Score",
subtitle = "Grouped by gender")
## `geom_smooth()` using formula 'y ~ x'
What are your preliminary findings/reflections on the data based on the descriptive statistics, the correlation matrix, and the scatter plots? - Based on the descriptive statistics, we know that all together students spend an average of 39 hours studying per week and that males and females spend roughly the same amount of time studying each week On average females score 47 points higher than males.In terms of IQ, males on average scored 2 points higher. We also know the data is somewhat skewed to the right (median < mean), which means that less people are making higher SAT math scores than if the data were normally distributed. Based on the correlation plot, IQ has the strongest positive correlation with SAT math scores (0.65). Parents’ years of schooling is also strongly correlated with higher SAT math scores, more so than with the other explanatory variables. For my variable of interest (hours) there appears to be a very minor negative relationship between average hours studied per week and SAT math scores (-0.03). This weak relationship is confirmed through the scatterplots, especially those with regression lines which show that while females tend to score higher, average weekly hours studied is not strongly correlated with SAT math
What other exploratory data analysis will be useful for you to better understand the data before modeling? Please implement some additional exploratory data analysis and discuss your preliminary findings.
sat_math as the DV and report regression results in a table using stargazer() from the stargazer package.# Creating the five models
m1 <- lm(sat_math ~ IQ, data = sat_data)
m2 <- lm(sat_math ~IQ + female + black + other, data = sat_data)
m3 <- lm(sat_math ~IQ + female + black + other + feduy + meduy, data = sat_data)
m4 <- lm(sat_math ~IQ + female + black + other + feduy + meduy + hours, data = sat_data)
m5 <- lm(sat_math ~IQ + female + black + other + feduy + meduy + hours + IQ*female, data = sat_data)
stargazer(m1, m2, m3, m4, m5, type = "text")
##
## ================================================================================================================================================
## Dependent variable:
## ----------------------------------------------------------------------------------------------------------------------------
## sat_math
## (1) (2) (3) (4) (5)
## ------------------------------------------------------------------------------------------------------------------------------------------------
## IQ 4.211*** 4.356*** 3.487*** 3.484*** 2.893***
## (0.154) (0.138) (0.132) (0.132) (0.175)
##
## female 53.831*** 51.693*** 51.597*** -9.552
## (3.565) (3.142) (3.143) (12.357)
##
## black -16.130*** -14.455*** -14.519*** -15.324***
## (4.432) (3.906) (3.906) (3.861)
##
## other -10.809* -6.918 -6.825 -6.152
## (6.005) (5.294) (5.295) (5.231)
##
## feduy 5.725*** 5.747*** 5.703***
## (0.474) (0.475) (0.469)
##
## meduy 6.434*** 6.434*** 6.379***
## (0.491) (0.491) (0.485)
##
## hours -0.264 -0.255
## (0.251) (0.248)
##
## IQ:female 1.232***
## (0.241)
##
## Constant 315.605*** 286.110*** 184.937*** 195.423*** 226.197***
## (7.906) (7.519) (8.921) (13.391) (14.530)
##
## ------------------------------------------------------------------------------------------------------------------------------------------------
## Observations 1,000 1,000 1,000 1,000 1,000
## R2 0.428 0.543 0.646 0.647 0.656
## Adjusted R2 0.427 0.541 0.644 0.644 0.653
## Residual Std. Error 62.734 (df = 998) 56.156 (df = 995) 49.452 (df = 993) 49.450 (df = 992) 48.835 (df = 991)
## F Statistic 746.637*** (df = 1; 998) 295.587*** (df = 4; 995) 302.435*** (df = 6; 993) 259.415*** (df = 7; 992) 236.007*** (df = 8; 991)
## ================================================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
(Intercept) and IQ row of the modeling results?coefplot(m5, intercept = F, innerCI = 1.96, outerCI = 1.96, color = "black", title = "Coefficient Plot of Model 5")
Simulation is a fun and effective way to learn about statistical inference. You will get a better understanding of how each population parameter affects the shape of the distribution.
Now that we have learned about how to identify interactions from a given sample, you can try simulate a data whose true data generating process involves interaction between two variables. For example, you can try to reproduce a similar scatter plot we saw in class (the right panel) by simulating a data whose variables have such associations:
Or, you can try to reproduce a scatter plot that demonstrates the Simpson’s Paradox:
Note: Your output does not need to replicate the exact layout of the example graphs. You will get extra credit as long as you generate a similar graph that illustrates the relationship (either a positive or negative interaction, or the Simpson’s Paradox) clearly. Remember to use set.seed() for any random process.